Extract Content (Web Mining)

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

document
The document port.

Output

document
The document port.

Parameters

extract_contentSpecifies whether content is extracted or not Range:
minimum_text_block_lengthThe minimum length (in words/tokens) of text blocks. Range:
override_content_type_informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag. Range:
neglegt_span_tagsSpecifies whether tags should be neglected or used as text block divider. Range:
neglect_p_tagsSpecifies whether tags should be neglected or used as text block divider. Range:
neglect_b_tagsSpecifies whether tags should be neglected or used as text block divider. Range:
neglect_i_tagsSpecifies whether tags should be neglected or used as text block divider. Range:
neglect_br_tagsSpecifies whether tags should be neglected or used as text block divider. Range:
ignore_non_html_tagsSpecifies whether tags that are not common HTML should be ignored. Range:

Categories

Versions